Goto

Collaborating Authors

 user identifier


PolitiSky24: U.S. Political Bluesky Dataset with User Stance Labels

arXiv.org Artificial Intelligence

Stance detection identifies the viewpoint expressed in text toward a specific target, such as a political figure. While previous datasets have focused primarily on tweet-level stances from established platforms, user-level stance resources, especially on emerging platforms like Bluesky remain scarce. User-level stance detection provides a more holistic view by considering a user's complete posting history rather than isolated posts. We present the first stance detection dataset for the 2024 U.S. presidential election, collected from Bluesky and centered on Kamala Harris and Donald Trump. The dataset comprises 16,044 user-target stance pairs enriched with engagement metadata, interaction graphs, and user posting histories. PolitiSky24 was created using a carefully evaluated pipeline combining advanced information retrieval and large language models, which generates stance labels with supporting rationales and text spans for transparency. The labeling approach achieves 81\% accuracy with scalable LLMs. This resource addresses gaps in political stance analysis through its timeliness, open-data nature, and user-level perspective. The dataset is available at https://doi.org/10.5281/zenodo.15616911


UserIdentifier: Implicit User Representations for Simple and Effective Personalized Sentiment Analysis

arXiv.org Artificial Intelligence

Global models are trained to be as generalizable as possible, with user invariance considered desirable since the models are shared across multitudes of users. As such, these models are often unable to produce personalized responses for individual users, based on their data. Contrary to widely-used personalization techniques based on few-shot learning, we propose UserIdentifier, a novel scheme for training a single shared model for all users. Our approach produces personalized responses by adding fixed, non-trainable user identifiers to the input data. We empirically demonstrate that this proposed method outperforms the prefix-tuning based state-of-the-art approach by up to 13%, on a suite of sentiment analysis datasets. We also show that, unlike prior work, this method needs neither any additional model parameters nor any extra rounds of few-shot fine-tuning.


Scalable Select of Random Rows in SQL

@machinelearnbot

If you're new to the big data world and also migrating from tools like Google Analytics or Mixpanel for your web analytics, you probably noticed performance differences. Google Analytics can show you predefined reports in seconds, while the same query for the same data in your data warehouse can take several minutes or even more. Such performance boosts are achieved by selecting random rows or the sampling technique. Let's learn how to select random rows in SQL. Before we start to work on sampling implementation, it is worth mentioning some sampling fundamentals. Sampling is based on a subset selection of individuals from some population to describe this population's properties.